Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Wed Dec 8 20:32:27 2021"
“How are you feeling right now? What do you expect to learn? Where did you hear about the course?”
I am extremely excited about this course! I have just started my PhD studies at the Institute of Atmospheric and Earth System Research. My research topic is related to building a framework for climate competencies, based mostly on interviews and surveys. Since my background is in natural sciences I have very little to no experience on behavioral sciences and their methodology. For this reason, I have chosen this course. It was suggested to me by our colleagues in the faculty of educational sciences.
I expect to learn to use R, which I’m not too worried about since I have experience in Python, and even more importantly, the statistical practices and rules that apply in human sciences, when it’s not all about nature and numbers.
As text: https://github.com/joulasip/IODS-project.git
date()
## [1] "Wed Dec 8 20:32:27 2021"
The data includes survey results from statistics course students in Finland regarding their learning. Data is collected 2014-2015. After combining variables that measure the same dimension and excluding observations where the exam points are zero, the data consists of 166 observations and 7 variables listed here:
The three learning approaches combine 8 subscales that all include 4 questions:
Reading the data and checking the structure (working diary being IODS-project):
data <- read.csv("data/learning2014", row.names = 1)
str(data)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Summary of the data — basic statistics of the variables:
summary(data)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Graphical overview using libraries GGally and ggplot2. The first variable, gender, is defining the colour and alpha sets transparency in the graphs so that the overlapping data can be seen:
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
p <- ggpairs(data, mapping = aes(col=data$gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
The correlation coefficient, corr, refers to Pearson correlation by default — correlations equal to +1 or −1 correspond to data points lying exactly on a line. Stars in the end the correlation value represent the p-values (as explained here):
There are almost twice as many female students than male. Majority of the students are between 20 and 30 years old. Male students have generally better attitude than female.
Points and attitude have the highest positive correlation 0.437, which is also significant (p < 0.001). Highest significant negative correlation can be seen between surface and deep learning approaches with corr = -0.324 (male up to -0.622). This is to be expected since they could be considered the opposite.
Strategic learning approach is slightly more common among female students than male.
Less significant negative correlation can be seen between surface and strategic learning approaches, corr = -0.161, and between surface learning approach and attitude, corr = -0.176.
The distribution of deep learning approach is shifted more towards higher values than with strategic or surface approach. The mean and median are approx. 3.6 — for strategic approach approx 3.1 and surface approach 2.8.
To me it seems odd, that the points do not correlate with deep learning approach at all. Could this mean that the exam/assignments do not measure the deep learning? Or that the deep learning approach does not in fact lead to learning? There is more evidence (all though no significance shown) that the surface learning approach has some negative correlation and strategic approach positive with the points.
Since the points have a highly significant correlation with attitude and a slightly significant correlation with surface and strategic learning approaches, I have chosen those three for the multiple regression as explanatory variables to the target variable “points”.
my_model_multi <- lm(points ~ attitude + stra + surf, data = data)
summary(my_model_multi)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The following formula describes the multiple linear model:
\(y = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \varepsilon_i\)
where \(\beta_0\) is the Intersept, \(\beta_1\) regression coefficient (Estimate) for attitude, \(\beta_2\) for strategic approach and \(\beta_3\) for surface approach. These measure the change in the mean response associated with a change in the corresponding explanatory variable. \(\varepsilon_i\) refers to the error terms (Std. Error) that are assumed to have a normal distribution with zero mean and the same variance \(\sigma^2\) for all values of the explanatory variables (as explained in the course material book Multivariate Analysis for the Behavioural Sciences).
The residuals (difference between an observed value of the target variable and the fitted value) have a median approx. 0.52. F-statistic gives a test of an omnibus null hypothesis that all the regression coefficients are zero, and it is calculated by dividing regression mean square (RGSM) by residual mean square (RMS). RMS gives an estimate of variance \(\sigma^2\). Since here F-statistic has a very low associated p-value (3.156e-08) it is very unlikely that all of the coefficients are zero — as can be expected.
The t-values are obtained by dividing the estimated regression coefficient by the standard error of the estimate, and the associated significance levels (Pr(>|t|) in the table) can indicate the importance of the explanatory variable.
According to the model, attitude has the strongest and the most significant (p < 0.001) positive correlation with the points. Since only the attitude has a high significance, I will try making models with attitude and surface approach, and attitude and strategic approach separately. The latter summary shown below.
my_model_2 <- lm(points ~ attitude + stra, data = data)
summary(my_model_2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
\(R^2\) gives the proportion of variability in the target variable accounted for by the explanatory variables — in the first case approx. 21% of the variability in points is accounted for by these three explanatory variables: attitude, strategic learning approach and surface learning approach. \(R\) is a measure of the fit of the model, the multiple correlation coefficient.
In the model with only attitude and strategic learning approach as explanatory variables, (shown above) the strategic approach has p-value of 0.09 so still below 0.1, so slightly better. In this model, according to \(R^2\), approximately the same amount of the variability, 20%, can be explained by the combination of attitude and strategic approach than in the previous model with three explanatory variables. This gives confidence that the importance of surface learning approach is very insignificant.
Let’s still try how the model would look with only one explanatory variable, attitude. The resulted summary is shown below — attitude alone counts for 19% of the variability in points.
my_model_3 <- lm(points ~ attitude, data = data)
summary(my_model_3)
##
## Call:
## lm(formula = points ~ attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
It is important to note that the connections between the explanatory variables can also have a major effect on the model, and they need to be taken into account when deciding which variables to use and which leave out! Here the surface and strategic learning approaches do have small negative correlation according to the overview of the data (part 2) but still much less significant than the correlation between attitude and points, so I consider it not important.
I will continue the analysis with the model including attitude and strategic learning approach as the explanatory variables.
par(mfrow = c(2,2))
plot(my_model_2, which = c(1,2,5))
The graph above includes the diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs. Leverage.
Assumptions of the model:
QQ-plot allows exploring the normality assumption — better the points fall into the line, better the assumption. The fit here is rather good especially in the middle, but there seem to be some outliers in both ends of the distribution that fall under the line.
Constant variance of errors and them not being dependent on the values of the explanatory variables can be assessed with the Residuals vs. fitted values graph. Any pattern in the plot would mean issues with this assumption. Here the plot has no clear patterns but the same outliers are marked in this one as well — the observations 145, 56 and 35 seem odd. Based on these two plots, I would look more deeply into those answers and would consider removing those observations all together if a clear reason for their behaviour is found.
Residuals vs leverage graph helps to identify observations that have unusually high impact on the model. Here again few observations stand out, and they are marked with numbers — 145 and 35 are there again. Since Cook’s distance is slightly higher around the same leverage values, it means that these values have a rather large influence on the model and should therefore be looked into more.
Overall the diagnostic plots look still quite good and the model can be considered valid. The same outliers are also present if the diagnostic plots are made for the model with only attitude as explanatory variable. Removing the outliers could lead to larger significance and clearer model results.
date()
## [1] "Wed Dec 8 20:32:33 2021"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(boot)
alc <- read.csv("data/alc", row.names = 1)
dim(alc)
## [1] 370 51
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
The data represents Portuguese secondary school students’ performance including grades, demographic, social and school related features. It was collected by using school reports and questionnaires. Two subjects are combined here: Mathematics and Portuguese language. More information can be found here including explanations of all the variables listed above: https://archive.ics.uci.edu/ml/datasets/Student+Performance
‘Alc_use’ combines weekday (Dalc) and weekend (Walc) alcohol use. ‘High_use’ is defined as TRUE if ‘alc_use’ > 2, so students consumption of alcohol is defined high if they are using alcohol twice a week or more.
Variables that I assume having important correlation to alcohol consumption, and my hypothesis about the connection:
In addition, I will include gender as a divider in the graphs.
Glimpse to the selected data again:
#select the wanted variables to check
sel_alc <- select(alc,one_of(c("G3","goout","absences","age")))
glimpse(sel_alc)
## Rows: 370
## Columns: 4
## $ G3 <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14, 1…
## $ goout <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2, 5…
## $ absences <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, 1, …
## $ age <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1…
Bar plots of the chosen variables:
gather(sel_alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Small amount of absences are common and only very few students have more than 10 absences. Most studetns are between 15 and 18 years ols and only few are older. Grades seem to follow normal distribution that is somewhat shifted towards better scores — 10 and 12 are the most received grades. Going out with friends is also quite normally distributed — most students responded with 3 or 2.
Making box plots of the variables as high use in the x-axis and gender defining the colour:
library(cowplot)
# initialize the plots for the 4 variables chosen
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex)) + geom_boxplot() + ylab("grade") + xlab("high use") + ggtitle("Student grade by alcohol consumption and sex")
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex)) + geom_boxplot() + xlab("high use") + ggtitle("Student absences by alcohol consumption and sex")
g3 <- ggplot(alc, aes(x = high_use, y = goout, col = sex)) + geom_boxplot() + ylab("going out") + xlab("high use") + ggtitle("Going out by alcohol consumption and sex")
g4 <- ggplot(alc, aes(x = high_use, y = age, col = sex)) + geom_boxplot() + xlab("high use") + ggtitle("Student age by alcohol consumption and sex")
plot_grid(g1,g2,g3,g4)
Producing summary statistics for each selected variable based on high use of alcohol and gender:
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count mean_grade
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 11.4
## 2 F TRUE 41 11.8
## 3 M FALSE 105 12.3
## 4 M TRUE 70 10.3
Grades: I hypothesized that the grades would be negatively affected by alcohol use. According to the box plot and the mean grade shown above, this is true only for male students — males not using too much alcohol have higher average grade than females (using alcohol often or not) and the males using alcohol often have notably lower average grade than females. The box plot shows that some male students have received very low grades in case of high alcohol consumption, which brings the average down quite a lot. For female students the variance is smaller, and no outliers are present (see box plot).
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count mean_absences
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 4.25
## 2 F TRUE 41 6.85
## 3 M FALSE 105 2.91
## 4 M TRUE 70 6.1
Absences: I assumed that the higher alcohol use would lead to absences. For both genders this seems to hold true. Male students not using alcohol have very low average of absences, only 2.9, so the difference to the high alcohol users is very clear, approx 3 more days of absences. With female students the difference is approx. 2.6 days. Female students have more absences in general and there are several students with much higher number of absences than the average — all the way to more than 40.
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count mean_goout
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 2.95
## 2 F TRUE 41 3.39
## 3 M FALSE 105 2.70
## 4 M TRUE 70 3.93
Going out: The box plot for going out looks a bit odd due to the likert scale. Median among the less alcohol using students is 3 and those using more than twise a week 4 (male and female). Here also the hypothesis holds. Difference between the mean values is larger for male students according to the summary statistics above. Students going out more are also consuming more alcohol, which makes a lot of sense.
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = mean(age))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count mean_age
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 16.6
## 2 F TRUE 41 16.5
## 3 M FALSE 105 16.3
## 4 M TRUE 70 16.9
Age: Mean age of those using alcohol many times a week is slightly smaller for female students and higher for male students than the age of non-alcohol users. This shows clearly in the median in the box plot. Female students maybe start drinking earlier than male students. It seems to hold true to some extent that younger students drink more — however, the students are generally quite young as shown in the bar plots earlier.
Statistically exploring the data with logistic regression:
m <- glm(high_use ~ age + absences + G3 + goout, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ age + absences + G3 + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8967 -0.7525 -0.5409 0.9467 2.2946
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.77834 1.96186 -1.926 0.05412 .
## age 0.04563 0.11022 0.414 0.67890
## absences 0.07315 0.02237 3.270 0.00108 **
## G3 -0.04472 0.03923 -1.140 0.25435
## goout 0.70668 0.11917 5.930 3.03e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 388.67 on 365 degrees of freedom
## AIC: 398.67
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) age absences G3 goout
## -3.77834234 0.04562870 0.07314823 -0.04471930 0.70667982
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02286056 0.0004637374 1.033922
## age 1.04668570 0.8430071337 1.299954
## absences 1.07589001 1.0312932049 1.127070
## G3 0.95626587 0.8852281993 1.032878
## goout 2.02724925 1.6139414455 2.577757
Odds ratios (OR) tell about the odds of high use of alcohol based on the explanatory variable. If the value is over 1, there is a positive effect of the explanatory variable to the target variable (high use). In this case, age, absences and going out increase the changes of high alcohol use, and alcohol use and grades have negative association. However, as the significance test in the model summary shows, only absences and going out have a statistical significance in the model.
In case the value 1 is between the confidence intervals (CI), there is no evidence of an association between the variables. In this case, only the confidence intervals of absences and going out don’t include the value 1, which means that those two variables and high alcohol use have association.
When it comes to the hypothesis before, this goes together with the previous analysis based on box plots. Absences have a clear positive correlation with the high alcohol use, as does going out with friends. Based on the previous, this model would look different regarding the grades if it was made for male students separately, whose grades were notably lower if they were using alcohol more than twice a week.
For this exploration I will include the variables ‘absences’ and ‘goout’ according to the analysis above. Here I will use the model to predict the the actual values, and save and compare the predictions to the actual values. In case probability is larger than 0.5, the high use is considered TRUE.
m2 <- glm(high_use ~ absences + goout, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m2, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, absences, goout, high_use, probability, prediction) %>% tail(10)
## absences goout high_use probability prediction
## 361 7 3 TRUE 0.28963176 FALSE
## 362 3 3 TRUE 0.23104046 FALSE
## 363 2 1 TRUE 0.06029226 FALSE
## 364 4 4 TRUE 0.40315734 FALSE
## 365 3 2 FALSE 0.12606082 FALSE
## 366 4 3 TRUE 0.24487648 FALSE
## 367 0 2 FALSE 0.10291931 FALSE
## 368 4 4 TRUE 0.40315734 FALSE
## 369 8 4 TRUE 0.47825016 FALSE
## 370 0 2 FALSE 0.10291931 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 236 23
## TRUE 65 46
The table of the last ten values shows that the model has predicted all of these cases to not have high alcohol consumption based on absences and going out, but the actual observations have 7/10 high alcohol usage. This model is clearly not perfect. Still the table of target variable vs. predictions shows that most cases are predicted correctly by the model.
Then I plot the predictions against the actual values (graphic visualization). If the model was perfect in predicting the actual values, the upper line would have dots only on the right side (probability > 0.5) and lower line only on the left side.
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.63783784 0.06216216 0.70000000
## TRUE 0.17567568 0.12432432 0.30000000
## Sum 0.81351351 0.18648649 1.00000000
According to the table, approximately 26 % of the predictions went wrong (the training error).
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2459459
The average number of wrong predictions in the cross validation is aroun 0.24-0.26 (changes with each run). Therefore the model’s prediction capabilities are very close to the model at Datacamp.
END OF WEEK 3!
date()
## [1] "Wed Dec 8 20:32:36 2021"
The data represents Housing Values in Suburbs of Boston from the 70s. Loading the data and taking a look at the data:
library(dplyr)
library(ggplot2)
library(tidyr)
library(boot)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## corrplot 0.92 loaded
# load the data
data(Boston)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
dim(Boston)
## [1] 506 14
There are 506 observations and the following 14 variables:
Looking at the summary of the data:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Making a correlation matrix and visualizing it:
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
Based on the visual overview the strongest negative correlation is between:
According to this it sounds like the employment centres produce NO emissions, which makes sense if they are based of heavy industry. Also the second is clear: areas with older buildings are built close to the work places and newer buildings are further away since the place is occupied already. More lower status population unsurprisingly correlates with lower value of owner-occupied homes.
And the strongest positive correlation is between:
These statements are also reasonable: the access to highways indicate higher property-tax rate and industry (non-retail) businesses emit NO emissions.
#centre and standardize variables
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
The mean of the variables is now zero since the standardizing includes dividing all the variables with their means.
Next I will create a factor variable of crime rate and remove the old one, as well as split the data into test and train sets for testing of predictions.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low","med_low","med_high","high"))
# removing the old crime rate variable
boston_scaled <- dplyr::select(boston_scaled, -crim)
# adding the new categorized crime rate variable to the standardized dataset
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows
n <- nrow(boston_scaled)
# choosing randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# creating the train set
train <- boston_scaled[ind,]
# ... and the test set
test <- boston_scaled[-ind,]
Then I will fit the linear discriminant analysis to the train data with crime rate as the target variable and all the other variables as predictor variables.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2673267 0.2500000 0.2301980 0.2524752
##
## Group means:
## zn indus chas nox rm age
## low 1.01365809 -0.9020188 -0.126510540 -0.8773833 0.42708690 -0.9115852
## med_low -0.08861018 -0.3519789 0.039520456 -0.5763419 -0.08170405 -0.3685727
## med_high -0.43099276 0.3180964 0.193349455 0.4174183 -0.07261634 0.4141706
## high -0.48724019 1.0171096 -0.002135914 1.0481898 -0.40179899 0.7976710
## dis rad tax ptratio black lstat
## low 0.9186185 -0.6968812 -0.7146801 -0.48969511 0.37513211 -0.76244089
## med_low 0.3082570 -0.5338554 -0.4509952 -0.04165789 0.32386115 -0.18029794
## med_high -0.3830739 -0.3866442 -0.2426354 -0.13243585 0.04203384 0.05939468
## high -0.8352580 1.6382099 1.5141140 0.78087177 -0.79350846 0.88685880
## medv
## low 0.52065193
## med_low 0.05015199
## med_high -0.02823585
## high -0.65125598
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12107054 0.666620929 -0.9420107
## indus 0.03864901 -0.478369303 -0.2264823
## chas -0.06610835 -0.050144462 0.2070847
## nox 0.39951497 -0.531165626 -1.3726847
## rm -0.07932483 -0.060757004 -0.1078376
## age 0.26306297 -0.267049735 -0.2171279
## dis -0.07292728 -0.175042425 -0.1094167
## rad 2.99518933 0.930812416 -0.2561549
## tax -0.01831158 0.072771367 0.9635695
## ptratio 0.13394016 0.030206484 -0.1712937
## black -0.12061832 0.001206622 0.1040602
## lstat 0.23946734 -0.110242026 0.4224073
## medv 0.17355932 -0.213025370 -0.1307237
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9478 0.0396 0.0126
The following represent the LDA biplot:
# making arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plotting the results with arrows
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
LD1 is the first linear discriminant and the LD2 the second. What these exactly represent hopefully becomes clearer next week. The arrows show the direction and extent of the effect of each explanatory variable: distance to the radial highways seems to be the strongest indicator of high crime rate.
The correct classes from the test data are saved and then removed here:
# saving the correct classes from test data
correct_classes <- test$crime
# lastly remove the crime variable from test data
test <- dplyr::select(test, -crime)
Then I will use the LDA model for predicting the classes, and cross tabulate the results with the categories from the test set.
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 7 0 0
## med_low 2 18 5 0
## med_high 2 10 21 0
## high 0 0 0 25
The category ‘high’ is very well predicted by the model. The observations belonging to the category ‘low’ are often predicted as ‘med-low’. In general the model seems pretty good, but due to randomness of defining the test and train groups, the model changes somewhat with every run and therefore the results vary as well. An average over many runs would be needed for more precise estimate of the performance of the model.
I will load the Boston data again and standardize it, after which I will calculate the distances between the observations.
data('Boston')
boston_scaled2 <- scale(Boston)
#making the data frame out of the data
boston_scaled2 <- as.data.frame(boston_scaled2)
#calculating euclidean distance matrix
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Next the K-means algoritm with 3 clusters is run and the results visualized (only columns 6-10 are plotted to see the results better):
km <-kmeans(boston_scaled2, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)
#finding the optimal number of clusters:
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
Around the x-value 2 in the previous plot, the the radical drop ends and therefore the optimal number of clusters is 2. This produces better groups:
km <-kmeans(boston_scaled2, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)
The clusters are the clearest in the variables rad (distance to the radial highways) and tax (property tax-rate). There is a well-separeted group of observations that all have same, very high value of rad. Same goes with the tax. These same observations (in the graph the red ones) appear close to each other also in the other variable pairs. They seem to also have high portion of old buildings, as well as short distance to employment centres.
The other variables not presented in the graph were checked but did not provide additional information for the interpretation of the results.
END OF WEEK 4!
date()
## [1] "Wed Dec 8 20:32:39 2021"
The data includes 155 observations and the following 8 variables related to human development globally:
I will start by reading the data and looking at the graphical overview and summary of the variables:
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
#library(boot)
library(corrplot)
library(FactoMineR)
# load the data
human <- read.csv("data/human.csv", header = TRUE, stringsAsFactors = TRUE, row.names = 1)
# summaries of the variables
summary(human)
## edu2R labR LEB EYE
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI MMR ABR PRP
## Min. : 2.00 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 53.50 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 99.00 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 98.73 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.:143.50 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :194.00 Max. :1100.0 Max. :204.80 Max. :57.50
# graphical overview with
p <- ggpairs(human, lower = list(combo = wrap("facethist", bins = 20)))
p
Few observations from the data overview:
These same observations can be made also from the correlation matrix presented below.
#plotting the correlation matrix
cor_matrix <- cor(human) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
Doing a PCA analysis on the data:
pca_human <- prcomp(human)
# making a summary of the PCA
s <- summary(pca_human)
# printing rounded percentages of variance captured by each principal component
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 92.3 6.0 1.4 0.3 0.0 0.0 0.0 0.0
# making the lables include the percentages
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1),
col = c("cornsilk4", "darkgoldenrod"),
xlab = pc_lab[1],
ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
PCA biplot of human development data shows the first two principal components, that together cover 98% of variation in the data. Gross National Income has mostly effect on the PC2 and maternal mortality rate is the strongest contributor to the PC1. The other variables have a very small eigenvalues (the vectors).
Next the variables will be standardized and the same PCA analysis done again.
human_std <- scale(human)
pca_human_std <- prcomp(human_std)
# making a summary of the PCA
s_ <- summary(pca_human_std)
# printing rounded percentages of variance captured by each principal component
pca_pr_ <- round(100*s_$importance[2, ], digits = 1)
pca_pr_
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 48.3 16.2 12.3 9.4 6.1 3.6 2.7 1.4
# making the lables include the percentages
pc_lab_ <- paste0(names(pca_pr_), " (", pca_pr_, "%)")
# draw a biplot
biplot(pca_human_std,
cex = c(0.8, 1),
col = c("cornsilk4", "darkgoldenrod"),
xlab = pc_lab_[1],
ylab = pc_lab_[2])
The results are looking quite different. Standardizing brings all the observations to the same range, and therefore it is easier to see the differences and correlations between them, and their ifluence on the principal components. PCA biplot of scaled human development data shows the first two principal components that explain together approximately 64 % of the total variation in the data. After standardizing there are more principal components with some effects (shown in the percentages above the graph) than without standardizing. In general it doesn’t really make sense to analyse the results of non-standardised PCA.
The arrows pointing to the same direction as the principal component axis represent the features contributing to that PC. Then again the arrows having a small angle between their directions are strongly correlated with each other. The length of the arrow tells about the standard deviation of the variable — longer arrow, higher standard deviation.
Therefore features defining the first principal component are expected years of education, ratio between women and men having secondary education, life expectancy at birth and Gross National Income — that all are rather highly correlated with each other — as well as maternal mortality ratio and adolescent birth rate, that are strongly correlated with each other as well. These features pointing to opposite directions also correlate with each other, but negatively. In conclution PC1 tells about level of education and giving birth.
The second principal component is mostly defined by proportion of females in parliament and ratio of female vs male in labour force, aka it tells something about equality in worklife.
Then we can look at the principal components together and in relation to the observations (Countries). The lower half of the graph shows countries with less women in power, and the upper half more. The left hand countries have a high life expectancy and good level on education, and the right hand side countries are not as developed education- or maternity-wise. Following from this, the countries with high education levels and high ratio of women in parliament are located in the upper left part of the graph and can be thought as the most developed, based on these indicators. At the opposite end of the graph, lower right, there are countries with high maternal mortality rates, higher adolescent birth rate and less women in power. These countries can be said to be the least developed.
This analysis complements the descriptive analysis done based on the correlation matrix, and offers more insight to the data — PCA analysis helps in seeing overarching themes rising from the data.
Checking on the tea data and Multiple Correspondence Analysis on the data:
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea) # observations 300, variables 36
## [1] 300 36
# selecting the following columns to keep for the analysis
keep_columns <- c("Tea", "How", "how", "sophisticated", "where", "spirituality")
# creating the new dataset based on the wanted variables
#FOR SOME REASON THIS PART IS CRASHING WHEN I KNIT THE index FILE, BUT WORKS IF I ONLY KNIT chapter5 FILE...
#tea_time <- select(tea, one_of(keep_columns))
#MCA on the data
#mca <- MCA(tea_time, graph = FALSE)
#making a plot about the results
#plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
Since I had problems with running the index file even though the scripts works well when kinning only chapter5 file, I will attach a picture of the results instead:
MCA plot of the selected variables of the tea data.
From the visualization we can see that using tea bags for making the tea goes hand in hand with buying the tea from a chain store, and similarly using unpacked tea and shopping in specific tea shops are very close. These make a lot of sense since most loose-leaf teas are sold in tea shops.
Black tea is close to sophisticated and milk usage rather close to not-sophisticated. Green tea and using lemon both stand out from the other variables.
These are maybe not the most interesting variables for the Multiple Correspondence analysis, and some others could have been chosen instead.
END OF WEEK 5!
date()
## [1] "Wed Dec 8 20:32:45 2021"
First we will take a look at data about rats and their nutrition. The data consists of body weight of three groups of rats measured over a 9-week period (64 days). The data was already converted from the wide to long format in the data wrangling part of the exercise. We start by reading the data in, and checking its structure. Variables ID and Group are changed to factors and the summary of the data shown. After that the data is drawn so that the groups of rats are in different graphs.
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
#library(boot)
#library(corrplot)
#library(FactoMineR)
# load the data (saved in long form)
RATSL <- read.table("data/RATS.txt", header = TRUE)
#wide format (original data) for later use
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep = "\t", header = T)
#checking the data structure
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
# making factors out of ID and Group variables
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
# summaries of the variables
summary(RATSL)
## ID Group WD Weight Time
## 1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00
## 2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00
## 3 : 11 3:44 Mode :character Median :344.5 Median :36.00
## 4 : 11 Mean :384.5 Mean :33.55
## 5 : 11 3rd Qu.:511.2 3rd Qu.:50.00
## 6 : 11 Max. :628.0 Max. :64.00
## (Other):110
# drawing the data, one group in one graph
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Weight (grams)") +
theme(legend.position = "none")
Already these graphs shows that the groups have rather different values of weight. First group of rats is the lightest and the last group the heaviest overall. The group 2 has one rat that is throughout the time series, notably heavier than the others — even the group 3 rats. Not surprisingly, the weight of all the rats is increasing during the test period of 9 weeks.
Next the data will be standardized and same graph is produced again.
# standardizing the data
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdweight = (Weight - mean(Weight))/ sd(Weight)) %>%
ungroup()
# drawing the standardized data
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Standardized weight") +
theme(legend.position = "none")
Now the growing trend is not so clear anymore.
Then a summary graph of the data will be produced.
# number of days, baseline (day 1) included
n <- RATSL$Time %>% unique() %>% length()
# summary data with mean and standard error of Weight by Group and time
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# taking a glimpse at the summary data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.30…
# plotting the mean profiles with errorbars
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.4)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
All of the lines are clearly separated and do not overlap, which suggests a difference between the groups.
Then we’re taking a summary measure approach looking into differences between groups, and producing boxplots of the mean values of each group.
# creating a summary data by Group and ID with mean as the summary variable (excluding the baseline)
RATSL8S <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# glimpse the data
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4…
# draw a boxplot of the mean versus Group
ggplot(RATSL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), 64 days")
## Warning: `fun.y` is deprecated. Use `fun` instead.
From the graph above we can see that one of the mean values in Group 2 is notably higher than others — this was visible already from the graphs including all the data. Let’s leave that outlier out of the analysis to avoid bias it might cause to the conclusions.
# create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL8S1 <- RATSL8S %>% filter(mean < 570)
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), 64 days")
## Warning: `fun.y` is deprecated. Use `fun` instead.
After removing the outlier, there is no overlap in any of the rats’ weight lines with other groups than their own. To check the validity of the conclusion that there is significant difference between the groups, we will create a linear regression model with the baseline (starting point weight) and the group as explanatory variables to the mean weight variable, and compute the variance analysis.
# add the baseline from the original data as a new variable to the summary data
RATSL8S2 <- RATSL8S %>%
mutate(baseline = RATS$WD1)
# fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATSL8S2)
# compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The analysis shows that the baseline correlated strongly with the mean weight — which makes sense — and that the group has a slightly significant correlation with it — 0.076 is at the border of not being significant, but is still something.
Now we have a new data set at hand. It represents 40 male subjects that we receiving different treatment for eight weeks during which they were rated on the brief psychiatric rating scale (BPRS) — once before the treatment and then weekly. The scale is used to evaluate whether or not the pacient has schizophrenia.
First we will take a look at the data, change variables treatment and subject into factors, and make a graph showing the time series of the two treatment groups.
# load the data (saved in long form)
BPRSL <- read.table("data/BPRS.txt", header = TRUE)
#checking the data structure
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
# making factors out of treatment and subject variables
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
# summaries of the variables
summary(BPRSL)
## treatment subject weeks bprs week
## 1:180 1 : 18 Length:360 Min. :18.00 Min. :0
## 2:180 2 : 18 Class :character 1st Qu.:27.00 1st Qu.:2
## 3 : 18 Mode :character Median :35.00 Median :4
## 4 : 18 Mean :37.66 Mean :4
## 5 : 18 3rd Qu.:43.00 3rd Qu.:6
## 6 : 18 Max. :95.00 Max. :8
## (Other):252
# graph showing all the data
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
There seem to be a downward trend in the BPRS during the study with almost all the patients. Also the lower values in the beginning seem to lead to smaller values in the end as well. However, the differences are large between the subjects.
It’s time to look at some different models of the data. First we will fit a linear mixed model to the data without taking the into account the repeated nature of the data.
# create a regression model RATS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
This model is not to be trusted since we know that there is very likely correlation between different measurements of the same subject. This correlation can be taken into account by a random factor.
To account for this we can use a random intercept model. The random component is assumed to be normally distributed and be constant in time. This allows the linear fit for each subject to differ in intercept from other subjects.
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
The estimated stardard error of week is smaller than with the linear mixed model, which comes down to the time-correlation being taken into account. This is still not good enough for us since random intercept model doesn’t often represent the observed pattern of variance and correlations between measurements well in longitudinal data.
To allow heterogeneity in both intercepts and slopes, we can use a random intercept and random slope model. The two random effects are assumed to have a bivariate normal distribution.
# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value associated with the chi-squared statistic given by the likelihood ratio test (anova) is quite small (*), which tells about the latter model providing a better fit for the data.
Final model to make is a random intercept and random slope model that allows interaction between treatment x week.
# create a random intercept and random slope model with the interaction
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | treatment), data = BPRSL, REML = FALSE)
## boundary (singular) fit: see ?isSingular
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | treatment)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2843.3 2870.5 -1414.7 2829.3 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8252 -0.7277 -0.2586 0.5697 4.0818
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## treatment (Intercept) 2.824e-02 0.16804
## week 1.765e-03 0.04201 -1.00
## Residual 1.516e+02 12.31302
## Number of obs: 360, groups: treatment, 2
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.3664 33.996
## week -2.2704 0.2531 -8.971
## treatment2 0.5722 1.2979 0.441
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.741
## treatment2 -0.475 0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular
# perform an ANOVA test on the two models - with interaction and without
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref2: bprs ~ week + treatment + (week | treatment)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref2 7 2843.3 2870.5 -1414.7 2829.3
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 97.896 0
Here anova gives us 0 degrees of freedom, and therefore no p-value. My assumption is that the random intersept and random slope model with interaction is therefore not useful for this data.
Let’s therefore use the previous model, the random intersept and random slope model, for finding fitted values and plotting them after the observed values for visual comparison.
# draw the plot of RATSL with the observed Weight values
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_x_continuous(name = "Weeks") +
scale_y_continuous(name = "BPRS") +
theme(legend.position = "none") +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both)
# create a vector of the fitted values from the wanted model
Fitted <- fitted(BPRS_ref1)
# create a new column fitted to RATSL
BPRSL <- BPRSL %>% mutate(fitted = Fitted)
# draw the plot of RATSL with the Fitted values of weight
ggplot(BPRSL, aes(x = week, y = Fitted, linetype = subject)) +
geom_line() +
scale_x_continuous(name = "Weeks") +
scale_y_continuous(name = "Fitted BPRS") +
theme(legend.position = "none") +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both)
I tried also plotting the fitted values based on the first model with only intersept related random effect but that leads to all the fitted lines being parallel (since no slope randomness is allowed). Then again the model including also interaction doesn’t produce anything as shown earlier. In the end, the best model is definitely this one plotted here even though it is not perfect either. In general, I would conclude that the linear models are not maybe the best here, but some more complicated correlation could be found. What can be said for sure is that there is no clear evidence of either treatment working better than the other (or differently in any way for that matter).
Thank you for the course!
THE END!